Users can run all cells in this notebook to obtain the desired outputs - visualisations created with matplotlib will already appear, however, to reproduce the interactive plots the notebook must be run on a local machine. Before proceeding, two modules need to be installed from the command line. These include:
Moreover, due to the large dataset, several processes took hours to run. In these cases, I have saved the required objects to the 'pickle_files' folder and reloaded them when required. The original code is included, but has been placed in docstrings so that it will not be executed.
Running all cells should take around 1 minute 15 seconds in its current form. All cells can be run now.
# Import necessary modules
# Some of the these modules will require installation from the command line (e.g. dash, vaderSentiment etc.)
import numpy as np
import pandas as pd
import glob
import datetime as dt
import math
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from pandas.plotting import register_matplotlib_converters
import seaborn as sns
import time
import random
import json
import pickle
import csv
import networkx as nx
from scipy import stats
from operator import itemgetter
from IPython.display import display, HTML, Image
from PIL import Image
import re
from sklearn.metrics import roc_curve, auc
# The modules below require installation from the command line
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
# The modules below are required to run code that is currently within docstrings - if you would
# like to run this code, these modules must be installed from the command line
'''
import tweepy
from langdetect import detect
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
'''
%matplotlib inline
init_notebook_mode()
# Allows side by side dataframe/images in the specified cell
CSS = """
div.cell:nth-child(8) .output {
flex-direction: row;
}
"""
HTML('<style>{}</style>'.format(CSS))
During my first year of university in Cape Town, I distinctly remember the sound of gunshots and stun grenades from outside the doors of our residence hall. These were the measures taken by the South African Police Force (SAPS) to control increasingly violent protests. Having been a part of this defining moment in South African education has motivated me to analyse the protests from a statistical perspective.
The protests began in mid-October 2015 and focused on attaining two primary goals:
Protests started at the University of Witwatersrand and spread to the University of Cape Town and Rhodes University. Within several weeks, protests were country-wide ultimately resulting in a national education crisis with an estimated total cost of $44.25 million, just in property damage. Images of protests at two universities during the crisis are shown below.
Having lived among the leaders of the #FeesMustFall movement in Leo Marquard Hall, I recall Twitter playing a major role in mobilising the youth, co-ordinating protests and providing a platform for debate surrounding the topic. This notebook delves deeper into the events of the 2015/2016 education crisis from a data perspective.
| University of Cape Town | Wits University |
|---|---|
![]() |
![]() |
![]() |
![]() |
The data to be analysed in this notebook consists of three datasets (drawn from two sources):
The table below contains information on the core and benchmark data. Variables removed during the cleaning phase are not included.
| Dataset | Tweet count (before clean) | Tweet count (after clean) | Variable 1 | Variable 2 | Variable 3 | Variable 4 | Variable 5 | Variable 6 | Variable 7 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Core | 447204 | 352841 | date | username | replies | retweets | favorites | text | mentions | hashtags | permalink |
| Benchmark | 112223 | 27342 | created_at | text | favorites | retweets | in_reply_to |
Core and benchmark data tweet volume will be used to investigate whether a correlation between Twitter activity and protest action exists. Next, the core data will be aggregated so as to further insight into the movement's key participants. Lastly, the actual tweet text from the core data is analysed to ascertain whether attitudes towards the movement were positive or negative.
While the qualitative nature of the data makes descriptive statistics less useful, a summary table and plot are presented below to give a better idea of the volume of tweets in the dataset. Each data point is the tweet frequency per hour of the day. On average, the dataset consists of 15000 tweets for every hour of the day, enough data to draw insightful inferences.
sum_stats = pd.read_pickle('pickle_files/sum_stats')
display(sum_stats)
boxplot = Image.open('images/summary_boxplot.png')
boxplot.thumbnail((400,800))
display(boxplot)
CSS = """
div.cell:nth-child(33) .output {
flex-direction: row;
}
"""
HTML('<style>{}</style>'.format(CSS))
This section includes explanations and code with details on how the various datasets were scraped.
Twitter have restricted free developer accounts from accessing tweets (by text and tags) further than 7 days in the past. Moreover, the paid API service limits users to 100 daily tweets. These limitations are significant when attempting to perform analysis on more than 350000 tweets. For this reason, alternative scraping methods were utilised to obtain the core data. Marquisvictor's GetOldTweets repository is used as an aid to scrape all tweets from the command line. In essence, this algorithm automates manual scrolling through Twitter. Given the required information, it scrapes Twitter data and metadata directly from the browser.
After installing the relevant packages, all tweets with the #FeesMustFall tag are saved into 4 csv files. Varying periods are used for each csv file to ensure that the scrape is not large enough to lead to a termination of the request. This procedure is carried out as indicated in the image below. Scraping the entire date range took approximately 21 hours.

# Read core data into memory from the scraped csv files
path = r'../GregAdamMeyer'
all_files = glob.glob(path + "/*.csv")
csv_li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
csv_li.append(df)
df = pd.concat(csv_li, axis=0, ignore_index=False)
df.head(3)
-- will be referred to as the benchmark data
This notebook proceeds to use the Twitter API to scrape tweets from random users who tweeted with the #FeesMustFall tag in the data above in order to draw conclusions about the engagement surrounding #FeesMustFall. Unlike the case with the core data, tweets older than 7 days can be scraped using the API provided the search parameter is the user's screen name and not the tweet text. During the cleaning phase, I only take tweets from 4 months after the protests onwards to ensure this data is not related to #FeesMustFall and hence a good benchmark against which to compare the core data. Scraping the data below took approximately 1 hour.
# Scrape auxilary data from Twitter API and read it into memory
# Load credentials from json file
with open("../GregAdamMeyer/twitter_api.json",
"r") as file:
secrets = json.load(file)
api_key = secrets['CONSUMER_KEY']
api_secret_key = secrets['CONSUMER_SECRET']
access_token = secrets['ACCESS_TOKEN']
access_token_secret = secrets['ACCESS_SECRET']
'''
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
'''
def get_tweets(screen_name):
tweets = []
# Initial request
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
tweets.extend(new_tweets)
# Save the id of the oldest tweet less one to avoid duplication
oldest = tweets[-1].id - 1
# Extract tweets until there are none left
while len(new_tweets) > 0:
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
tweets.extend(new_tweets)
oldest = tweets[-1].id - 1
# Transform array into a format that will be written to a csv file
outtweets = [[tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count,
tweet.in_reply_to_screen_name] for tweet in tweets]
# Write to csv
with open('../2019mt-st445-project-GregAdamMeyer/random_users/'
+ '%s_tweets.csv' % screen_name, 'w') as f:
writer = csv.writer(f)
writer.writerow(["created_at","text","favorites","retweets",
"in_reply_to"])
writer.writerows(outtweets)
# Obtain a list of unique users who tweeted with the #FeesMustFall tag
users = set(df['username'])
# Take 50 of the users above, get their most recent tweets and save them to csv files
'''
for count, user in enumerate(users):
get_tweets(user)
if count == 50:
break
'''
# Read the benchmark data into memory from the csv files
path = r'../GregAdamMeyer/random_users'
user_files = glob.glob(path + "/*.csv")
csv_l = []
for users in user_files:
df_rnd = pd.read_csv(users, index_col=None, header=0)
csv_l.append(df_rnd)
df_rnd = pd.concat(csv_l, axis=0, ignore_index=False)
df_rnd.head(3)
The following steps are carried out during the initial data clean:
Different versions of the core dataframe are required for different visualisations of the data. The initial data clean is done in this section, and further ad hoc data pivots, groupbys and other cleaning methods are performed throughout the notebook.
# Clean core data
# First drop the duplicates in each scraped csv due to overlapping scrape dates
# Subset by permalink as these links will be unique
df.drop_duplicates(keep = 'first', inplace = True, subset = 'permalink')
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df_replies = df[~df['to'].isnull()] # keep the replies in a different dataframe
df = df[df['to'].isnull()] # only want tweets that aren't replies
df.drop(columns=['to', 'geo', 'id'], inplace = True) # drop unwanted columns
df.sort_values(by = 'date' ,inplace = True)
df = df.reset_index()
df.drop(columns='index', inplace = True) # drop extra index column
display(df.head(3))
# df will remain unchanged and act as the core data - manipulations and pivots will be
# performed on copies of this dataframe
The benchmark data is cleaned in a similar manner to the core data. However, due to the contrasting output format as a result of using the API, as well as the purpose for which this data is to be used, two additional steps are taken:
# Clean benchmark tweet data
df_rnd['created_at'] = pd.to_datetime(df_rnd['created_at'])
df_rnd.head(15)
# Only want tweets from well after the protest to ensure that the bulk of
# these tweets are not related to the #FeesMustFall topic - this will allow
# for a more accurate comparison
df_rnd = df_rnd[df_rnd['created_at'] > dt.datetime(2017,2,1,0,0,0)]
df_rnd.drop_duplicates(inplace = True)
df_rnd = df_rnd[df_rnd['in_reply_to'].isnull()] # don't want tweets that are replies
# Dataset contains retweets - need to remove retweets to obtain only user original tweets
df_rnd = df_rnd.reset_index()
# Create a column with the first two letters of the tweet
df_rnd['first_2_letters'] = df_rnd['text'].astype(str).str[0:2]
# Remove retweets
df_rnd = df_rnd[df_rnd['first_2_letters']!='RT']
df_rnd = df_rnd.reset_index()
df_rnd.drop(columns=['index', 'first_2_letters', 'level_0'], inplace = True)
df_rnd.tail(3)
This notebook proceeds to shape the core data for network analysis. First, an undirected network is created. This is in the form of a list where each element consists of two connected nodes in set format. Nodes/users are considered to be connected in a #FeesMustFall context if the users have interacted via mentions or replies. A network with 112011 edges is obtained.
The following steps were taken to clean and shape the data for network analysis:
# Create network of nodes and edges
# First, create a network consisting of all users
# Trim the network later on to aid visualisation
'''
# Subset tweets that have mentions
df_nx = df[~df['mentions'].isnull()][['username', 'mentions']]
df_nx.reset_index(inplace = True)
df_nx.drop(columns = 'index', inplace = True)
def parse_mentions(mentions):
# Parses all the mentions for a specific tweet
# as a list and removes the @ symbol (docstrings
# not used here as the entire block of code is
# wrapped in a docstring)
m_list = mentions.split()
for i in range(len(m_list)):
m_list[i] = m_list[i].lstrip('@')
return m_list
# Cycle through all user/mention combinations and add them
# to the ntwrk list if they are not already there
ntwrk = []
for entry in df_nx.itertuples():
for each_mention in parse_mentions(entry.mentions):
n_edge = {entry.username, each_mention}
if n_edge not in ntwrk:
ntwrk.append(n_edge)
# Next, cycle through all user replies and add them to the ntwrk
# list if they are not already there
df_replies = df_replies[['username', 'to']]
for entry in df_replies.itertuples():
n_edge = {entry.username, entry.to}
if n_edge not in ntwrk:
ntwrk.append(n_edge)
# Remove people who mention/reply to themselves
ntwrk_no_self_mentioners = []
for i in ntwrk:
if len(i) == 2:
ntwrk_no_self_mentioners.append(i)
ntwrk = ntwrk_no_self_mentioners
# The ntwrk list takes some time to generate so we just save it
# as a pickle file - the code below can be deleted if you would
# prefer to run the code
with open('pickle_files/ntwrk.pickle', 'wb') as handle:
pickle.dump(ntwrk, handle, protocol=pickle.HIGHEST_PROTOCOL)
'''
with open('pickle_files/ntwrk.pickle', 'rb') as handle:
ntwrk = pickle.load(handle)
The #FeesMustFall movement kicked off after the announcement that university fees for 2015 would increase by 10.5%. The tweet below summarises how many students felt. Much of the momentum the protests gained can be attributed to these feelings.

Firstly, this section gives a brief overview of the unfolding of events. Secondly, the section explores whether the prominence of the #FeesMustFall movement and periods of intense protest was linked to the volume of tweets over time.
# Plot volume of tweets by date
df_man = df.iloc[:]
df_man['date'] = df['date'].dt.floor('T') # remove seconds - allows for better plot visualisation
df_man['date_temp'] = [i.date() for i in df_man['date']]
volume = (df_man.groupby('date_temp')['username'].count())
register_matplotlib_converters()
fig, ax = plt.subplots(figsize=(18, 6))
plt.plot(volume,'r')
plt.title('Number of #FeesMustFall tweets over time', fontsize = 15)
ax.set_facecolor('whitesmoke')
plt.xlabel('Date')
plt.ylabel('Tweet Frequency')
plt.xlim([dt.date(2015,9,1), dt.date(2016,11,15)])
# Plot dashed lines to show periods of intense protests
plt.axvline(dt.date(2015, 10, 8), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2015, 10, 29), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2016, 9, 19), color='b', linestyle='dashed', linewidth=1)
plt.axvline(dt.date(2016, 10, 28), color='b', linestyle='dashed', linewidth=1)
plt.show()
The plot above gives initial insight into the periods of peak Twitter activity. The dashed blue lines represent the start and end dates of heightened protests for each respective year. The link between Twitter activity and protest actions is explored further in the next plot.
# Create interactive volume plot
volume_df = pd.DataFrame(volume)
volume_df = volume_df.rename({'username': 'tweet_count'} ,axis = 'columns')
volume_df['date'] = volume_df.index
volume_df['year'] = volume_df['date'].map(lambda x: x.year)
volume_df['month'] = volume_df['date'].map(lambda x: x.month)
# We now create the points for the significant dates to be marked on the plotly line graph
signif_dates = [dt.date(2015,9,14),dt.date(2015,10,8),dt.date(2015,10,14),
dt.date(2015,10,19),dt.date(2015,10,20),dt.date(2015,10,21),
dt.date(2015,10,23),dt.date(2015,11,1),dt.date(2016,9,19),
dt.date(2016,10,10),dt.date(2016,10,19),dt.date(2016,11,1)]
for day in signif_dates: # add significant dates with zero tweet count to dataframe
if day not in volume_df.index:
volume_df.loc[day] = [0, day, day.year, day.month]
volume_df.sort_values(by = 'date' ,inplace = True)
scatter_df = volume_df[volume_df.index.isin(signif_dates)]
volume_df[volume_df.index.isin([dt.date(2015,11,1),dt.date(2016,10,15)])]
fig = px.line(volume_df, x = volume_df.index, y = 'tweet_count',
title='Number of #FeesMustFall tweets over time',
color = 'year', hover_data = ['tweet_count'],
labels = {'year':'Year', 'tweet_count': 'Tweet Count',
'x':'Date'})
actual_text = ['<br>Signs of student unrest in KZN</br>',
'<br>First major protest </br> Presidential task team announced',
'<br>Birth of the #FeesMustFall movement </br> Systematic shutdown of \
South African universities <br>Protests escalate</br>',
'<br>Protests are countrywide</br>Courts grant interdicts against students <br>\
Police and students engage in violent confrontations</br>',
'<br>6% fee increase announced </br>Students reject this proposition',
'<br>Parliamentary grounds are breached </br>Students controlled with stun grenades,\
teargas and riot shields<br>Multiple arrests take place</br>',
'<br>0% fee increase announced </br>Protests reach their peak <br>Riots in Pretoria\
- police vehicles are burned</br>',
'<br>Protests start losing momentum </br>Lectures take place online\
<br>Students given the option to defer exams</br>',
'<br>8% fee increase for 2016 announced </br>Movement regains momentum\
<br> Universities shut down and move to online lectures </br>',
'<br>Rubber bullets, stun grenades and smoke grenades fired at students\
attempting to enter the Great Hall at Wits</br>',
'<br>Two security guards beaten by students at UCT </br>Private security\
companies contracted to protect the campus',
'<br>Steel structure exam venue constructed on the UCT rugby field </br>\
Exam venue secured by private security and canine units']
hovertext = ['<b>' + str(day) + '</b>' + actual_text[i] for i, day in enumerate(signif_dates)]
fig.add_trace(go.Scatter(
x=scatter_df.index,
y=scatter_df.tweet_count,
mode='markers',
name='markers',
hovertext=hovertext,
hoverinfo="text",
marker=dict(
color="green",
size = [10]*12
),
showlegend=False
))
fig.update_layout(xaxis_title="Date", yaxis_title="Tweet Frequency",
xaxis_range=[dt.date(2015,9,1), dt.date(2016,11,15)],
xaxis_rangeslider_visible=True)
fig.show()
Given information surrounding pivotal events in the movement, it becomes clear that there is a strong correlation between tweet frequency and protest action. Hover over the green circles above for information on events occurring on specific dates. You can also use the range slider at the bottom to restrict the dates in view.
The significant events corresponding to the green circles above are described in greater detail below:
From the above, it is clear that a spike in tweets with the tag #FeesMustFall coincided with periods of heavy protest action and defining events. The strong link between Twitter activity and protest action is apparent.
*University abbreviations used above:
UKZN - University of KwaZulu-Natal (Durban)
UCT - University of Cape Town (Cape Town)
Wits - University of the Witwatersrand (Johannesburg)
CPUT - Cape Peninsula University of Technology (Cape Town)
NMMU - Nelson Mandela Metropolitan University (Port Elizabeth)
# Plot tweet volume by time of day
# Group tweets by the time of day they were tweeted
df_man['time_temp'] = [i.time() for i in df_man['date']]
volume_time = df_man.groupby('time_temp').size()
# Create a list of the hourly count of tweets over the entire date range
hourly_count = []
freq = 0
for time, frequency in zip(volume_time.index, volume_time):
freq += frequency
if time.minute == 59:
hourly_count.append(freq)
freq = 0
# Create a simple list of date time objects for every hour
hour_list = []
for hour in range(24):
hour_list.append(dt.time(hour, 0, 0))
fig, ax = plt.subplots(figsize=(18, 6))
ax.plot(hour_list, hourly_count, 'royalblue')
ax.set_xlabel("Time of day")
ax.set_ylabel("Tweet Frequency (by hour)")
ax.set_title("Tweet frequency across different times of the day")
ax.set_facecolor('whitesmoke')
ax.set_label('Minutely frequency')
ax2 = ax.twinx()
ax2.plot(volume_time, 'g')
ax2.set_ylabel('Tweet Frequency (by minute)')
ax2.set_xticks([dt.time(hour, 0, 0) for hour in range(24)])
green_patch = mpatches.Patch(color='g', label='Minutely frequency')
r_blue_patch = mpatches.Patch(color='royalblue', label='Hourly frequency')
plt.legend(handles=[r_blue_patch, green_patch], loc = 'upper left')
plt.show()
# ====================================================================================
# Create summary table and boxplot to be shown as descriptive statistics (output of
# this code is presented earlier in the notebook)
hourly_count = pd.DataFrame(hourly_count)
hourly_count = hourly_count.rename({0: 'Value'} ,axis = 'columns')
sum_stats = hourly_count.describe()
sum_stats.rename({'count': 'Number of data points (Hours in day)',
'mean': 'Average Tweets per hour',
'std': 'Standard Deviation', 'min': 'Min Tweets per hour',
'25%': 'Lower Quartile (25%)', '50%': 'Median (50%)',
'75%': 'Upper Quartile (75%)', 'max': 'Max Tweets per hour'},
axis='index', inplace = True)
sum_stats.rename({'replies': 'Replies', 'retweets': 'Retweets',
'favorites': 'Favourites'},
axis='columns', inplace = True)
# Save to pickle file so df can be shown in Dataset chapter
sum_stats.to_pickle('pickle_files/sum_stats')
hourly_count = hourly_count.rename({'Value': 'Hourly Volume'} ,axis = 'columns')
fig, ax = plt.subplots()
sns.boxplot(data=hourly_count[['Hourly Volume']],notch = True,color = 'g',
saturation = 1)
plt.title('Boxplot showing distribution of hourly tweets')
fig.tight_layout()
plt.savefig('images/summary_boxplot.png', dpi=300)
plt.close(fig)
The plot above illustrates hourly (left y-axis) and minutely (right y-axis) tweet volume. With the knowledge that the majority of protests occurred around midday, the obvious spike in Twitter activity during this time strengthens the claim that protests are heavily linked to Twitter activity.
# Plot tweet volume by engagement type for core data
engage = df_man.groupby('time_temp')['replies', 'retweets', 'favorites'].sum()
labs = ['Replies', 'Retweets', 'Favorites']
plt.figure(figsize=(18,6))
plt.stackplot(engage.index, engage['replies'], engage['retweets'], engage['favorites'],
labels = labs, colors = ['blue', 'orangered', 'green'])
plt.legend(fontsize = 12)
plt.xticks(3600*np.arange(0, 26, 2), ('0:00', '2:00', '4:00', '6:00', '8:00',
'10:00', '12:00', '14:00', '16:00',
'18:00', '20:00', '22:00', '24:00'))
plt.xlabel('Time of day')
plt.ylabel('Engagement frequency')
plt.title('Engagement type proportions for #FeesMustFall data')
# Plot zoomed figure
engage_zoom = engage[2*60:2*60+30]
sub_axes = plt.axes([.185, .55, .25, .25]) # location on original graph
sub_axes.stackplot(engage_zoom.index, engage_zoom['replies'], engage_zoom['retweets'],
engage_zoom['favorites'], labels = labs, colors = ['blue', 'orangered', 'green'])
sub_axes.set_xticks([dt.time(2, 5*minute, 0) for minute in range(7)])
sub_axes.set_xlabel('Time of day')
sub_axes.set_ylabel('Engagement frequency')
plt.show()
# RAM is overloaded - delete these dataframes to improve runtime
del df_man
del volume
del volume_df
del scatter_df
The above plot analyses the split between likes, replies and retweets. Replies are a form of active engagement where the user is looking to engage in discussion whereas likes and retweets function as a means to spread a message or express agreement with the tweet in question.
The plot makes clear that #FeesMustFall related tweets are predominantly engaged via retweets reinforcing the active nature of the movement. The zoomed plot in the top left corner demonstrates that this trend holds, even at low engagement frequency hours of the day.
We proceed to compare the above plot to benchmark data in order to ascertain whether this trend is specific to the #FeesMustFall movement.
# Plot tweet volume by engagement type for benchmark data
# Remove seconds - allows for better plot visualisation
df_rnd['created_at'] = df_rnd['created_at'].dt.floor('T')
df_rnd['time_temp'] = [i.time() for i in df_rnd['created_at']]
rnd_tweets = df_rnd.groupby('time_temp')['retweets', 'favorites'].sum()
# Group the stackplot data by hour as a result of the large fluctuation in the minutely data
rnd_tweets['index'] = rnd_tweets.index
rnd_tweets['hour'] = rnd_tweets['index'].apply(lambda x: x.hour)
rnd_tweets = rnd_tweets.groupby('hour')['retweets', 'favorites'].sum()
rnd_tweets
rnd_labs = ['Retweets', 'Favorites']
plt.figure(figsize=(18,6))
plt.stackplot(rnd_tweets.index, rnd_tweets['retweets'],
rnd_tweets['favorites'], labels = rnd_labs,
colors = ['orangered', 'green'])
plt.xticks(0.958*np.arange(0, 26, 2), ('0:00', '2:00', '4:00', '6:00', '8:00',
'10:00', '12:00', '14:00', '16:00',
'18:00', '20:00', '22:00', '24:00'))
plt.legend()
plt.xlabel('Time of day')
plt.ylabel('Engagement frequency')
plt.title('Engagement type proportions for benchmark data')
plt.show()
Data in the plot above is taken from a sample of users who actively tweeted about #FeesMustFall. However, the data in question involves these users' tweets from 6 months after the protests until 2020/01/05 ensuring the majority of these tweets are not related to #FeesMustFall. This allows for meaningful comparison. Unfortunately, one requires a premium API subscription to access the reply count (evidenced here) and so replies are excluded as a metric in this plot.
Retweets form a much smaller proportion of engagement in the plot above relative to #FeesMustFall engagement. This suggests the retweet proportion of engagement on the #FeesMustFall topic is abnormally high. This reinforces the notion that the movement was primarily concerned with action, rather than discussion.
Anaylsis in this section suggests Twitter activity was linked to protest action. This is evidenced by:
This notebook proceeds to analyse interactions, influential users/institutions and clusters. Using the ntwrk dataset, the following key statistics within the network are calculated:
To best visualise the network structure, the network of the largest connected subgraph is depicted. This gives insight into interactions between the most active users during the #FeesMustFall campaign.
# Perform network analysis on #FeesMustFall users
'''
# Create graph by adding edges from ntwrk
G = nx.Graph()
for node_1, node_2 in ntwrk:
G.add_edge(node_1, node_2, weight=1)
degrees = [val for (node, val) in G.degree()]
nx.is_connected(G) # graph isn't connected so we look at connected subgraphs
largest_subgraph = max((G.subgraph(c) for c in nx.connected_components(G)), key=len)
# Took 24 hours to run below code because network is large and centrality calculations are
# computationally heavy
graph_centrality = nx.degree_centrality(largest_subgraph)
max_de = max(graph_centrality.items(), key=itemgetter(1)) # output: eNCA with value of 0.05
graph_closeness = nx.closeness_centrality(largest_subgraph)
max_clo = max(graph_closeness.items(), key=itemgetter(1)) # output: eNCA with value of 0.35
graph_betweenness = nx.betweenness_centrality(largest_subgraph, normalized=True, endpoints=False)
max_bet = max(graph_betweenness.items(), key=itemgetter(1)) # output: eNCA with value of 0.16
network_stats = pd.DataFrame({'Value':[nx.number_of_nodes(G),nx.number_of_edges(G),
np.max(degrees),np.min(degrees),np.mean(degrees),
stats.mode(degrees)[0][0],
nx.number_connected_components(G),
largest_subgraph.number_of_nodes(),
largest_subgraph.number_of_edges(),
nx.average_clustering(largest_subgraph),
nx.transitivity(largest_subgraph),
'eNCA/0.05','eNCA/0.35','eNCA/0.16']},
index = ['Number of nodes','Number of edges',
'Max degree','Min degree','Average degree',
'Most frequent degree',
'Number of connected components',
'Number of nodes in largest subgraph',
'Number of edges in largest subgraph',
'Clustering co-efficient (largest subgraph)',
'Transitivity (largest subgraph)',
'Node with highest degree centrality/value',
'Node with closeness degree centrality/value',
'Node with betweenness degree centrality/value'])
with open('pickle_files/network_stats.pickle', 'wb') as handle:
pickle.dump(network_stats, handle, protocol=pickle.HIGHEST_PROTOCOL)
'''
with open('pickle_files/network_stats.pickle', 'rb') as handle:
network_stats = pickle.load(handle)
# Plot the network of the largest subgraph
# Position nodes using Fruchterman-Reingold force-directed algorithm.
'''
node_and_degree = largest_subgraph.degree()
colors_central_node = ['red']
central_nodes = ['eNCA']
pos = nx.spring_layout(largest_subgraph, k=0.05)
fig = plt.figure(figsize = (20,20))
nx.draw(largest_subgraph, pos=pos, node_color=range(largest_subgraph.number_of_nodes()),
cmap=plt.cm.PiYG, edge_color="black", linewidths=0.3, node_size=60, alpha=0.6,
with_labels=False)
nx.draw_networkx_nodes(largest_subgraph, pos=pos, nodelist=central_nodes, node_size=300,
node_color=colors_central_node)
fig.savefig("images/full_net.png", bbox_inches='tight', dpi=600)
'''
display(network_stats)
net = Image.open("images/full_net.png")
net.thumbnail((420,420))
display(net)
The graphic above is a depiction of the largest subgraph centred around the eNCA node - details on the features of the graph are available on the left. Unfortunately, there are too many nodes to be able to visualise the network successfully. Therefore, alternative ways to visualise the network are considered.
This notebook proceeds to focus on the network of influential users in the movement. The motivation behind this is that the clustering co-efficient is larger than the transitivity ratio suggesting the possibility that low-degree nodes form clusters around higher degree nodes. This possibility is explored in the next few cells with a particular focus on accounts affiliated with the respective South African universities.
# Analyse network of influential users
conn_dict = {}
for node_1, node_2 in ntwrk:
if node_1 in conn_dict:
conn_dict[node_1]+=1
else:
conn_dict[node_1] = 1
if node_2 in conn_dict:
conn_dict[node_2]+=1
else:
conn_dict[node_2] = 1
# Analyse users with more than 100 connections and then extract key participants
# in the protests who are affiliated to an institution of which most are obtained
# from this list
for user in conn_dict:
if conn_dict[user]>100:
print(user, end = ' | ')
significant_users = ['TuksUPrising', 'FeesMustFall',
'RhodesMustFall','WitsFMF','RhodesSRC']
pop_ntwrk = []
for node_1, node_2 in ntwrk:
if node_1 in significant_users or node_2 in significant_users:
pop_ntwrk.append([node_1, node_2])
print('\n\nThe significant users to be analysed are:\n' , significant_users)
# Perform network analysis on high profile #FeesMustFall users
# Create graph by adding edges from ntwrk
G2 = nx.Graph()
for node_1, node_2 in pop_ntwrk:
G2.add_edge(node_1, node_2, weight=1)
print("The graph has %d nodes with %d edges." % (nx.number_of_nodes(G2), nx.number_of_edges(G2)))
pos = nx.layout.spring_layout(G2)
#Create Edges
edge_trace = go.Scatter(
x=[],
y=[],
line=dict(width=0.5,color='#888'),
hoverinfo='none',
mode='lines')
for edge in G2.edges():
x0, y0 = pos[edge[0]]
x1, y1 = pos[edge[1]]
edge_trace['x'] += tuple([x0, x1, None])
edge_trace['y'] += tuple([y0, y1, None])
node_trace = go.Scatter(
x=[],
y=[],
text=[],
mode='markers',
hoverinfo='text',
marker=dict(
showscale=True,
colorscale='Electric',
reversescale=True,
color=[],
size=10,
colorbar=dict(
thickness=15,
title='Number of Node Connections',
xanchor='left',
titleside='right'
),
line=dict(width=2)))
for node in G2.nodes():
x, y = pos[node]
node_trace['x'] += tuple([x])
node_trace['y'] += tuple([y])
# Add color to node points
for node, adjacencies in enumerate(G2.adjacency()):
node_trace['marker']['color']+=tuple([len(adjacencies[1])])
node_info = 'Name: ' + str(adjacencies[0]) + '<br># of connections: '+str(len(adjacencies[1]))
node_trace['text']+=tuple([node_info])
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
title='Network graph of high profile #FeesMustFall institutions',
titlefont=dict(size=16),
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
annotations=[ dict(
showarrow=False,
xref="paper", yref="paper",
x=0.005, y=-0.002 ) ],
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
print('Hover over the graph below for information on respective nodes.')
fig
From the analysis above, several inferences can be made:
*RhodesMustFall is the UCT account, RhodesSRC is the Rhodes University account
UCT and Wits were pivotal to the #FeesMustFall movement. There were strong links that tied the two institutions together. UP and Rhodes University were also influential, albeit to a lesser extent.
TextBlob is built over the NLTK library. The library aids sentiment analysis by:
The classifier groups tweets into three classes:
vaderSentiment (Valence Aware Dictionary and sEntiment Reasoner) is another tool used to classify tweet sentiment. The process followed is similar to the one described above until the final step, where vaderSentiment uses different training data and models to assign sentiment polarity scores. More information can be found here.
In order to anaylse performance of the two models, the sentiment on the first 100 tweets is manually attributed. This is then compared to each respective model's predicted sentiment. The results are presented in the form of an ROC curve.
Because the sentiment analysis includes three classes, regular binary classification cannot be used. A OneVsAll multiclass classification technique is adopted. This entails plotting an ROC curve for each class where that class is considered the 'positive' class and all other classes are considered the 'negative' class. This technique is applied to both sentiment analysis models after restricting the data to English tweets.
# Perform sentiment analysis
# All data is prepared in this cell - plots are completed in later cells
# Checking whether data is English and assigning sentiment polarity scores took
# around 3 hours to run
def clean_tweet(tweet):
'''Cleans tweets by removing links, special characters
using regex statements.'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ",
tweet).split())
def is_english(tweet):
'''Returns true if the tweet is in English. Returns False if
the tweet is not English or if it returns an error'''
try:
if detect(tweet) == 'en':
return True
else:
return False
except:
return False
# Create functions for TextBlob sentiment analysis
def get_tweet_sentiment_score(tweet):
'''Takes a tweet as an input and returns its sentiment polarity score'''
analysis = TextBlob(clean_tweet(tweet))
return analysis.sentiment.polarity
def assign_tweet_sentiment(polarity, threshold):
'''Takes 2 arguments:
1. The polarity score for a tweet
2. The threshold above which a tweet is considered to have positive polarity
Returns the sentiment of the tweet (positive, negative or neutral)'''
if polarity > threshold:
return 'Positive'
elif polarity == threshold:
return 'Neutral'
else:
return 'Negative'
# Create function for vaderSentiment analysis
def compound_score(tweet):
'''Takes a tweet as an input, cleans the tweet and returns
the vaderSentiment compound polarity score'''
analyzer = SentimentIntensityAnalyzer()
tweet = clean_tweet(tweet)
return analyzer.polarity_scores(tweet)['compound']
'''
# Replicate df to be manipulated
df_man = df.iloc[:]
# Limit the analysis to tweets in English as the TextBlob library was trained on English data
df_man = df_man[df_man.text.apply(is_english)] # 22236 tweets are removed
# Perform TextBlob sentiment analysis
df_man['TB_sentiment_score'] = df_man.text.apply(get_tweet_sentiment_score)
# Perform vaderSentiment analysis
df_man['VS_sentiment_score'] = df_man.text.apply(compound_score)
df_man = df_man.reset_index()
df_man.drop(columns='index', inplace = True)
# Manually attribute sentiments for first 100 tweets to test accuracy
df_man['true_sentiment'] = 'n/a'
positive_pos = [1,3,4,5,6,7,9,10,11,13,15,17,18,
22,23,24,26,27,28,29,30,31,32,33,35,
37,40,46,50,55,56,57,58,59,66,
67,68,69,73,75,77,79,83,
87,88,89,95,96]
negative_pos = [12,85,98,99,100]
neutral_pos = [0,2,8,14,16,19,20,21,25,
34,36,38,39,41,42,43,44,45,47,
48,49,51,52,53,54,60,61,62,63,
64,65,70,71,72,74,76,78,80,81,82,84,
86,90,91,92,93,94,97]
df_man['true_sentiment'].loc[positive_pos] = 'Positive'
df_man['true_sentiment'].loc[negative_pos] = 'Negative'
df_man['true_sentiment'].loc[neutral_pos] = 'Neutral'
df_man.to_pickle('pickle_files/df_man')
'''
with open('pickle_files/df_man', 'rb') as handle:
df_man = pickle.load(handle)
# Display sentiment columns
df_man.iloc[:3,[0,5,9,10,11]]
# Compute ROC curves for different techniques
classes = ['Positive', 'Negative', 'Neutral']
y_test_score = df_man.loc[:100][['true_sentiment', 'TB_sentiment_score', 'VS_sentiment_score']]
# Compute ROC curve and ROC area for each class (TextBlob)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in classes : # we have three classes
fpr[i], tpr[i], _ = roc_curve(y_test_score.true_sentiment,
y_test_score.TB_sentiment_score,
pos_label = i)
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot of a ROC curve for a specific class
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(18, 6))
for i in classes:
ax1.plot(fpr[i], tpr[i], label= i + ' (area = %0.2f)' % roc_auc[i])
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.05])
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC curve: TextBlob')
ax1.legend(loc="lower right")
# Compute ROC curve and ROC area for each class (vaderSentiment)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in classes : # we have three classes
fpr[i], tpr[i], _ = roc_curve(y_test_score.true_sentiment,
y_test_score.VS_sentiment_score,
pos_label = i)
roc_auc[i] = auc(fpr[i], tpr[i])
for i in classes:
ax2.plot(fpr[i], tpr[i], label= i + ' (area = %0.2f)' % roc_auc[i])
ax2.plot([0, 1], [0, 1], 'k--')
ax2.set_xlim([0.0, 1.0])
ax2.set_ylim([0.0, 1.05])
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate')
ax2.set_title('ROC curve: vaderSentiment')
ax2.legend(loc="lower right")
plt.show()
It is important to note that the neutral ROC curve is non-informative. Because values within a small range of a polarity score of zero are classified as neutral, the threshold restriction would have to be expressed as an absolute value i.e. $|Threshold|<0.05$. For this reason, an ROC plot is not appropriate to illustrate the ideal threshold in this context - the curve is only included for completeness purposes.
From the above, it can be seen that both methods perform similarly. Because the TextBlob method performs marginally better on the positive sentiment curve and because it is trained on a larger dataset, we adopt this classifier. The classifier is applied to all tweets - results are illustrated in the following cell.
# Visualise sentiment split
# Group data to analyse sentiment split
df_man['TB_sentiment_pred'] = df_man.TB_sentiment_score.apply(assign_tweet_sentiment, args = (0,))
sentiment_count = df_man.groupby('TB_sentiment_pred')[['username']].count()
sentiment_count.rename({'username': 'count'},
axis='columns', inplace = True)
sent_fig = make_subplots(rows=1, cols=2, specs=[[{'type':'pie'}, {'type':'xy'}]])
colors = ['gold', 'mediumturquoise', 'darkorange']
sent_fig.add_trace(go.Pie(labels=['Negative', 'Neutral', 'Positive'],
values=[sent for sent in sentiment_count['count']]),
row=1, col=1)
sent_fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
marker=dict(colors=colors, line=dict(color='#000000', width=2)))
sent_fig.update_layout(title={'text': '#FeesMustFall Sentiment Split',
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
sent_fig.add_trace(go.Bar(x=['Negative', 'Neutral', 'Positive'],
y=[sent for sent in sentiment_count['count']],
marker_color=colors, showlegend = False,
hoverinfo = 'y'),
row=1, col=2)
sent_fig.show()
The above plots suggest that more than half of all tweets are neutral while positive sentiment towards the movement is almost double that of negative sentiment. Finally, a wordcloud is computed below:
# Create a string of all words in the tweets
text = " ".join([i for i in df_man['text']])
# Remove common, irrelevant words
stopwords = set(STOPWORDS)
more_stopwords = ['twitter','pic','https','ly','ow','FeesMustFall',
'co','za','need','bit','say','want','come','fb',
'fees','must','fall','RT','instagram','South','Africa']
for each_word in more_stopwords:
stopwords.add(each_word)
wc = WordCloud(background_color="white", stopwords=stopwords, scale = 4,
width=2500, height=1000).generate(text)
plt.figure(num=None, figsize=(15, 5), dpi=80)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
The word cloud above reaffirms the results seen in the split between positive and negative sentiment towards the movement. Words such as 'National Shutdown' and 'Free Education' take up large spaces in the word cloud and are terms likely to be used by those in favour of the movement. Negative words like 'fragmented' appear smaller on the above graphic reinforcing that there is a larger proportion of positive sentiment towards the movement.
While most of the tweets are neutral commentary on the topic, a larger proportion of the polar tweets encompass a positive outlook on the movement as opposed to a negative one.
From the above analysis, three primary conclusions can be reached: